1 Introduction
This HTML notebook will replicate the STATA do file named vnm_mics14_dp2019 for distribution in R. The goal is to create an R script that does what the said do file does.
I will follow the same section numbering as in the do file for ease of comparision.
2 Libraries
Before we go into the excercise, following are the packages from which we will use various functions.
## here() starts at D:/R projects/OPHI/Vietnam Translation/Vietnam-MICS-MPI-to-R
## -- Attaching packages ------------
## v ggplot2 3.3.0 v purrr 0.3.4
## v tibble 3.0.1 v dplyr 0.8.5
## v tidyr 1.0.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
## -- Conflicts ---------------------
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Warning: package 'janitor' was built under R version 3.6.3
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
## Warning: package 'gt' was built under R version 3.6.3
3 Building a Well-Being Index from Vietnam MICS 2013-14
The memory and environment clearing commands and the commands for setting working folders and paths are not needed if one is working in Project mode through RStudio and uses the {here} (read as package here). This is a library for managing paths and directories.
4 Vietnam MICS 2014
4.1 Step 1: Data Preparation
__ Selecting main variables from CH, WM, HH & MN recode & merging with HL recode __
It should be noted that anthropometric data was not collected for children under 5 as part of the Viet Nam MICS 2014 dataset. Previously, nutrition data was collected as part of Viet Nam MICS 2011. However, the data was not collected in this round due to time and resource constraints as well as the availability of national nutrition survey data (p.61)
Above comments are copied from the STATA do file
4.1.1 Step 1.1: CH - Children’s Recode (Under 5)
According to the STATA do file there is no data for this section.
4.1.2 Step 1.2: BH - Birth Recode (All females 15-49 years who ever gave birth)
The purpose of step 1.2 is to identify children of any age who died in the last 5 years prior to the survey date. As seen in the STATA file
Loading the data from bh.sav.
bh_dat <- read_sav(file = here("Viet Nam_MICS5_Datasets",
"Viet Nam MICS 2013-14 SPSS Datasets",
"bh.sav"))
bh_dat <- clean_names(bh_dat)The above code chunk loads the bh.sav and names it bh_dat(a data object). The clean names function gets all the variable names in lower snake case. IN case if anyone is wondering what are all the possible cases, please refere to the wonderful art by Allison Horst shown below.
Various Cases (Let me know your favourite)
Now let us take a glimplse at the data and names of the variables.
## [1] "hh1" "hh2" "ln" "bhln" "bh2" "bh3"
## [7] "bh4m" "bh4y" "bh5" "bh6" "bh7" "bh8"
## [13] "bh9u" "bh9n" "bh10" "bh4c" "bh4f" "bh9c"
## [19] "bh9f" "hh6" "hh7" "wdoi" "wdob" "ethnicity"
## [25] "welevel" "brthord" "magebrt" "birthint" "wmweight" "wscore"
## [31] "windex5" "wscoreu" "windex5u" "wscorer" "windex5r" "windex2"
| hh1 | hh2 | ln | bhln | bh2 | bh3 | bh4m | bh4y | bh5 | bh6 | bh7 | bh8 | bh9u | bh9n | bh10 | bh4c | bh4f | bh9c | bh9f | hh6 | hh7 | wdoi | wdob | ethnicity | welevel | brthord | magebrt | birthint | wmweight | wscore | windex5 | wscoreu | windex5u | wscorer | windex5r | windex2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 2 | 1 | 1 | 1 | 4 | 2011 | 1 | 2 | 1 | 4 | NA | NA | NA | 1336 | 1 | NA | NA | 1 | 1 | 1368 | 1044 | 1 | 4 | 1 | 2 | 1 | 2.167035 | 1.640778 | 5 | 1.337577 | 5 | NA | NA | 2 |
| 1 | 2 | 2 | 2 | 1 | 1 | 5 | 2013 | 1 | 0 | 1 | 5 | NA | NA | 2 | 1361 | 1 | NA | NA | 1 | 1 | 1368 | 1044 | 1 | 4 | 2 | 2 | 2 | 2.167035 | 1.640778 | 5 | 1.337577 | 5 | NA | NA | 2 |
| 1 | 3 | 2 | 1 | 1 | 2 | 3 | 2003 | 1 | 10 | 1 | 4 | NA | NA | NA | 1239 | 1 | NA | NA | 1 | 1 | 1368 | 939 | 1 | 3 | 1 | 2 | 1 | 2.167035 | 1.427797 | 5 | 1.071892 | 5 | NA | NA | 2 |
| 1 | 3 | 2 | 2 | 1 | 1 | 6 | 2007 | 1 | 6 | 1 | 5 | NA | NA | 2 | 1290 | 1 | NA | NA | 1 | 1 | 1368 | 939 | 1 | 3 | 2 | 2 | 4 | 2.167035 | 1.427797 | 5 | 1.071892 | 5 | NA | NA | 2 |
| 1 | 4 | 2 | 1 | 1 | 2 | 11 | 2007 | 1 | 6 | 1 | 4 | NA | NA | NA | 1295 | 1 | NA | NA | 1 | 1 | 1368 | 1008 | 1 | 4 | 1 | 2 | 1 | 2.167035 | 1.613568 | 5 | 1.303632 | 5 | NA | NA | 2 |
| 1 | 6 | 2 | 1 | 1 | 1 | 3 | 1995 | 1 | 18 | 1 | 3 | NA | NA | NA | 1143 | 1 | NA | NA | 1 | 1 | 1368 | 788 | 1 | 4 | 1 | 2 | 1 | 2.167035 | 1.624926 | 5 | 1.317802 | 5 | NA | NA | 2 |
Now let us create variables, only those that are non grouped, i.e. thhose generated without using bysort. In R one can create grouped varibales in a different manner. Will show that in the code chunk following this one.
bh_dat <- bh_dat %>%
mutate(
ind_id = structure(((hh1*100000)+(hh2*100)+ln),
label = "Individual ID"),
date_death = structure(bh4c + bh9c,
label = "Date of Death CMC"),
mdead_survey = structure(ifelse(((bh9c == 0 |is.na(bh9c)) & bh5 == 1),
NA,
wdoi - date_death),
label = "Months Dead from Survey"),
ydead_survey = structure(mdead_survey/12,
label = "Years dead from Survey"),
age_death = structure(ifelse(bh5 == 2, bh9c, NA),
label = "Age at death in Months"),
child_died = structure(forcats::as_factor(
ifelse(is.na(bh5),NA,
ifelse(bh5==1,"child is alive","child is dead")))),
child18_died = forcats::as_factor(
ifelse((!is.na(age_death) & age_death > 216),
ifelse(is.na(bh5),NA,
ifelse(bh5==1,
"child is alive",
"child is dead")),
"child is alive"))
)Creating Tables for new variables check.
| age_death | n | percent | valid_percent |
|---|---|---|---|
| 0 | 242 | 1.563408e-02 | 0.455743879 |
| 1 | 33 | 2.131921e-03 | 0.062146893 |
| 2 | 17 | 1.098262e-03 | 0.032015066 |
| 3 | 17 | 1.098262e-03 | 0.032015066 |
| 4 | 9 | 5.814329e-04 | 0.016949153 |
| 5 | 5 | 3.230183e-04 | 0.009416196 |
| 6 | 14 | 9.044512e-04 | 0.026365348 |
| 7 | 6 | 3.876219e-04 | 0.011299435 |
| 8 | 10 | 6.460366e-04 | 0.018832392 |
| 9 | 2 | 1.292073e-04 | 0.003766478 |
| 11 | 5 | 3.230183e-04 | 0.009416196 |
| 12 | 23 | 1.485884e-03 | 0.043314501 |
| 14 | 5 | 3.230183e-04 | 0.009416196 |
| 18 | 3 | 1.938110e-04 | 0.005649718 |
| 19 | 1 | 6.460366e-05 | 0.001883239 |
| 20 | 2 | 1.292073e-04 | 0.003766478 |
| 23 | 4 | 2.584146e-04 | 0.007532957 |
| 24 | 22 | 1.421280e-03 | 0.041431262 |
| 36 | 27 | 1.744299e-03 | 0.050847458 |
| 48 | 14 | 9.044512e-04 | 0.026365348 |
| 60 | 6 | 3.876219e-04 | 0.011299435 |
| 72 | 9 | 5.814329e-04 | 0.016949153 |
| 84 | 3 | 1.938110e-04 | 0.005649718 |
| 96 | 5 | 3.230183e-04 | 0.009416196 |
| 108 | 6 | 3.876219e-04 | 0.011299435 |
| 120 | 1 | 6.460366e-05 | 0.001883239 |
| 132 | 1 | 6.460366e-05 | 0.001883239 |
| 144 | 3 | 1.938110e-04 | 0.005649718 |
| 156 | 1 | 6.460366e-05 | 0.001883239 |
| 168 | 3 | 1.938110e-04 | 0.005649718 |
| 180 | 1 | 6.460366e-05 | 0.001883239 |
| 192 | 7 | 4.522256e-04 | 0.013182674 |
| 204 | 3 | 1.938110e-04 | 0.005649718 |
| 216 | 7 | 4.522256e-04 | 0.013182674 |
| 228 | 4 | 2.584146e-04 | 0.007532957 |
| 240 | 3 | 1.938110e-04 | 0.005649718 |
| 252 | 1 | 6.460366e-05 | 0.001883239 |
| 264 | 2 | 1.292073e-04 | 0.003766478 |
| 276 | 3 | 1.938110e-04 | 0.005649718 |
| 312 | 1 | 6.460366e-05 | 0.001883239 |
| NA | 14948 | 9.656955e-01 | NA |
## Warning: Column `bh5` has different attributes on LHS and RHS of join
| bh5 | child is alive | child is dead |
|---|---|---|
| 1 | 14948 | 0 |
| 2 | 0 | 531 |
| child18_died | n | percent |
|---|---|---|
| child is alive | 15465 | 0.9990955488 |
| child is dead | 14 | 0.0009044512 |
Creating grouped variables.
Notice that I need not write equivalent R code for the following STATA code chunks:
1> This has been included in the grouped condition for the variable
replace tot_child18_died_5y=. if child18_died==1 & ydead_survey==. //Replace as ‘.’ if there is no information on when the child died
2> This need not be written as the structure of data is different, only one rwo exists per ind_id NOte that due to this feature there is no need to create “child18u_died_per_wom_5y” as it is same as “tot_child18_died_5y”
bysort ind_id: egen childu18_died_per_wom_5y = max(tot_child18_died_5y) lab var childu18_died_per_wom_5y “Total child under 18 death for each women in the last 5 years (birth recode)”
3> This need not be written as the structure of data is different, only one rwo exists per ind_id
//Keep one observation per women bysort ind_id: gen id=1 if _n==1 keep if id==1 drop id duplicates report ind_id
bh_dat %>%
group_by(ind_id) %>%
summarise(
tot_child_died = sum(child_died == "child is dead", na.rm = T),
tot_child18_died_5y = sum(
child18_died[ydead_survey <= 5 & !is.na(ydead_survey)] == "child is dead", na.rm = T),
women_BH = 1
) %>%
mutate(
tot_child18_died_5y = ifelse(
(is.na(tot_child18_died_5y) & tot_child_died>=0 & !is.na(tot_child_died)),0,tot_child18_died_5y)
)-> grouped_bh_datCreating variables for new grouped variables
## tot_child_died 0 1
## 0 6647 0
## 1 382 7
## 2 55 0
## 3 6 2
## 4 2 0
Writing new files.
4.1.3 Step 1.3: WM - Women’s Record (All eligible females 15-49 in the household)
Loading data and setting case for variable (column) names.
wm_dat <- read_sav(file = here("Viet Nam_MICS5_Datasets",
"Viet Nam MICS 2013-14 SPSS Datasets",
"wm.sav"))
wm_dat <- clean_names(wm_dat)Mutating the data for creating new variables. This is what one would refere to generating new varibles in STATA, I think.
Note, for the varible marital or marital_wom, I miss to see whyn the STATA code defines it as marital to begin with and renames it to marital_wom. I will define this varibale as marital_wom from the begining itself. Kindly bring to notice if this is incorrect.
wm_dat <- wm_dat %>%
mutate(
ind_id = structure(((hh1*100000)+(hh2*100)+ln),
label = "Individual ID"),
women_wm = 1,
marital_wom = structure(
forcats::as_factor(
case_when(
mstatus == 3 & is.na(ma6) ~ "Never Married",
mstatus == 1 & is.na(ma6) ~ "Currently Married",
mstatus == 2 & ma6 == 1 ~ "Widowed",
mstatus == 2 & ma6 == 2 ~ "Divorced",
mstatus == 2 & ma6 == 3 ~ "Seperated/Not living together",
)
), label = "Marital Status of Household Member"
)
) While creating new variables the STATA code has various tables. I will generate those here.
## [1] TRUE
This means all Individual Ids are unique, i.e., every row represents a unique individual.
| cm1 | 1 | 2 | NA_ |
|---|---|---|---|
| 1 | 454 | 6647 | 0 |
| 2 | 0 | 2726 | 0 |
| NA | 0 | 0 | 363 |
## # A tibble: 10 x 2
## mstatus ma6
## <dbl+lbl> <dbl+lbl>
## 1 1 [Currently married/in union] NA
## 2 1 [Currently married/in union] NA
## 3 1 [Currently married/in union] NA
## 4 3 [Never married/in union] NA
## 5 3 [Never married/in union] NA
## 6 3 [Never married/in union] NA
## 7 3 [Never married/in union] NA
## 8 1 [Currently married/in union] NA
## 9 1 [Currently married/in union] NA
## 10 3 [Never married/in union] NA
| mstatus | 1 | 2 | 3 | NA_ |
|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 6972 |
| 2 | 207 | 193 | 107 | 0 |
| 3 | 0 | 0 | 0 | 2348 |
| NA | 0 | 0 | 0 | 363 |
| marital_wom | n | percent | valid_percent |
|---|---|---|---|
| Currently Married | 6972 | 0.68420020 | 0.70947390 |
| Never Married | 2348 | 0.23042198 | 0.23893355 |
| Widowed | 207 | 0.02031403 | 0.02106441 |
| Divorced | 193 | 0.01894014 | 0.01963977 |
| Seperated/Not living together | 107 | 0.01050049 | 0.01088837 |
| NA | 363 | 0.03562316 | NA |
| ma6 | Currently Married | Never Married | Widowed | Divorced | Seperated/Not living together | NA_ |
|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 207 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 193 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 107 | 0 |
| NA | 6972 | 2348 | 0 | 0 | 0 | 363 |
| mstatus | Currently Married | Never Married | Widowed | Divorced | Seperated/Not living together | NA_ |
|---|---|---|---|---|---|---|
| 1 | 6972 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 207 | 193 | 107 | 0 |
| 3 | 0 | 2348 | 0 | 0 | 0 | 0 |
| NA | 0 | 0 | 0 | 0 | 0 | 363 |
Keeping relevant columns.
Writing the tables in csv.
4.1.4 Step 1.4: MR - MEn’s Recode (All eligible men in the household)
As mentioned in the STATA code, Note: There is no male recode file for Viet Nam MICS 2014. Hence the commands under this section have been remove
4.1.5 Step 1.5 HH - Household Record (All Households intereviewed)
Load data.
hh_dat <- read_sav(file = here("Viet Nam_MICS5_Datasets",
"Viet Nam MICS 2013-14 SPSS Datasets",
"hh.sav"))
hh_dat <- clean_names(hh_dat)Creating new variables.
write data in csv.
4.1.6 step 1.6: HL - Household Member
Load data.
hl_dat <- read_sav(file = here("Viet Nam_MICS5_Datasets",
"Viet Nam MICS 2013-14 SPSS Datasets",
"hl.sav"))
hl_dat <- clean_names(hl_dat)Creating new variables.
4.1.7 Step 1.7: Data Merging
Unlike the STATA code the merging here will look a little different. It will be in a single flow. However, checks will be shown prior to mrunnig the merge command.
hl_dat %>%
left_join(grouped_bh_dat, by = c("ind_id" = "ind_id")) %>%
left_join(wm_dat, by = c("ind_id" = "ind_id")) %>%
tabyl(hl7, show_na = T)## hl7 n percent
## 0 29190 0.7412392077
## 1 1065 0.0270441849
## 2 4650 0.1180802438
## 3 2037 0.0517267649
## 4 1479 0.0375571356
## 5 541 0.0137379380
## 6 228 0.0057897410
## 7 110 0.0027932961
## 8 38 0.0009649568
## 9 27 0.0006856272
## 10 9 0.0002285424
## 11 3 0.0000761808
## 13 2 0.0000507872
## 15 1 0.0000253936
hl_dat %>%
left_join(grouped_bh_dat, by = c("ind_id" = "ind_id")) %>%
left_join(wm_dat, by = c("ind_id" = "ind_id")) %>%
mutate(temp = hl7 > 0) %>%
tabyl(women_wm, temp, show_na = T)## women_wm FALSE TRUE
## 1 0 10190
## NA 29190 0
hl_dat %>%
left_join(grouped_bh_dat, by = c("ind_id" = "ind_id")) %>%
left_join(wm_dat, by = c("ind_id" = "ind_id")) %>%
mutate(temp = hl7 > 0) %>%
filter((temp == 1 & is.na(women_wm))) %>%
tabyl(wm7, show_na = T)## [1] wm7 n percent
## <0 rows> (or 0-length row.names)
hl_dat %>%
left_join(grouped_bh_dat, by = c("ind_id" = "ind_id")) %>%
left_join(wm_dat, by = c("ind_id" = "ind_id")) %>%
left_join(hh_dat, by = c("hh_id" = "hh_id")) %>%
filter(hh9 ==1) %>% # keeping only households that completed interview
mutate(
marital_men = structure(NA, label = "Marital Status of household member")
) -> merged_datWriting this object for safe keeping.